OVERVIEW

We identified potential high suitable areas for a group of species triggering criteria A1 (threatened species) and B1 (geographic restricted species) aimed to inform the Key Biodiversity Areas (KBAs) initiative in Canada (see KBAs Canada and KBA standard). To do that, we first evaluated habitat suitability models relating species presence-only data sets and predictor variables, using the Maxent algorithm implemented in the ENMeval v.2.0.0 package (Kass et al 2021) in R. Then, we used cluster analysis to identify discrete polygons containing most of the high suitable areas.

We ran the analysis for a group of 96 species as a proof of concept aimed to provide recommendations for a full implementation (~ 1,200 species).

In this report we present these sections:

  • Workflow: in this section describe all the steps we followed to implement HSMs and identify potential areas for KBAs.
  • Results: in this section we describe our findings for all species evaluated and some recommendations for further steps.
  • Species reports: in this section you can access detailed results for each species, including maps (e.g., habitat suitability predictions, uncertainty), performance metrics, variable importance, among others.
  • Methods: in this section we present a complete description of the analysis, data sets and tools we used and applied in each step of the process.

Scope of the analysis

  • This large-scale analysis offers the first approximation to identify potential areas that might harbor KBAs in Canada, for a large group of species of interest (~ 1,200 species).

  • The spatial resolution of this analysis is ~1 km2.

  • This macro-analysis will follow expert reviews and local-scale analysis with finer data sets.

  • It will help also to prioritize areas for further automated KBA assessments.

  • We used this approach to identify areas not only for species of ‘global interest’ (i.e., triggering A1 and B1 criteria) but for some species of ‘national interest’ (i.e., these species could trigger national A1 and B1 KBAs in Canada).

  • Because the geographic range of some species go across the border, we collected and analyzed data for North America as an area of study (Canada, US, and Mexico).

  • The selection of species for this first batch was based on the status of the EBAR-maps, as follow:

    • Reviewed by experts (128). We are going to start with this group as a proof of concept (we ran HSMs for 96 species).

    • Species with EBAR-maps partially reviewed by experts (98).

    • Species with EBAR-maps auto-generated (921).

  • This tool was designed to be updated and replicated (e.g., if new species and observations are added to the list). Scripts were developed in R and interactive outputs in Markdown.

  • Because this tool was designed to offer relevant information to a wide spectrum of users and decision makers and to facilitate their decision-making process, we provide results for each step of the process that may impact model reliability and usefulness of final outputs.


WORKFLOW

The workflow presented below describes the steps we follow to identify potential sites for all species triggering Criteria A1 and B1 (see KBA standard for additional information). We also offer a preliminary analyses of the sensitivity of these areas to climate change scenarios.

We developed a top-down approach that can:

  • be applied to all trigger species.
  • be updated and replicated.
  • offer a quick overview and assessment of potential sites for KBAs.
  • be complemented with bottom-up approaches currently under development.

Figures and maps presented here are only for illustration purposes.

——————————-

1. We started the workflow by asking where are high suitable areas located within the geographic range of the species?

To do that we implemented Habitat suitability Models (HSMs) aimed to identify high suitable areas within the geographic range of each species. As geographic range proxy we used ecosystem-based automated range (EBAR) maps developed by NatureServe Canada to bound each HSM (see more information here). EBAR maps are comprised of jurisdiction approved Ecoshapes (generally Ecodistricts, Level IV Ecoregions, or similar) that are triggered by species observations. EBAR maps were also subject to a thorough process of review by experts on each of the species of interest.



2. Then, we were interested in knowing whether high suitable areas were clustered or dispersed across the landscape.

This is our first approximation to identify potential areas harboring most of the high suitable areas that might harbor KBAs.


3. Finally, we evaluated to what extent high suitable areas might change in the future under various climate change scenarios.

This additional evaluation will provide decision-makers with a sense of potential threats for species due to different scenarios of climate change.

We provide maps identifying where those changes (‘losing’ or ‘gaining’ suitable areas) might occur within the geographic range of each species.


Back to top
——————————-


RESULTS

In this section we present a summary of the general findings we encountered when running HSMs for a sample of 96 species triggering criteria A1 (threatened species) and B1 (geographic restricted species) aimed to inform the Key Biodiversity Areas (KBAs) initiative in Canada (see Figure below).

We also want to highlight some of the main challenges we encountered when running multiple HSMs.

Let’s start describing the outputs we obtained from this process.

1. Unable to run HSMs

There is a group of species where we were unable to run run HSMs. We identified three cases:

  • Computer performance limitations (n=34; 35%): we used a PC with 16GB in RAM and ran HSMs in parallel, using the ENMeval package option. It seems that it is not enough clusters for some species that have a broad geographic extent. Computer crashed after 4-5 hours. We recommend to move the analysis fr this group of species to a High Performance Computer (HPC). It is likely that this group of species will generate reliable models, because they have good number of observations.

  • Not enough observations (n=20; 21%): HSMs do not run with species with less than 3 observations (at least using th ENMeval package and Maxent Algorithm). We suggest to use alternative ways to inform KBAs initiative, such as: expert knowledge, current observations and/or critical habitat.

  • Some spatial issues between observations and EBARS (n=2; 2%): we were unable to run models for two species. It seems that observations and EBARs are in a different coordinates systems, so they do not overlap. This is something to review with NatureServe.


2. Species with HSMs predictions

For this group of species we were able to run HSMs and classified them based on one of the model performance metric “Omission rate (OR)”, as follow:

Omission rate (OR) indicates the “fraction of the test localities that fall into pixels not predicted as suitable for the species. A low omission rate is a necessary (but not sufficient) condition for a good model.”(Phillips et al., 2006 ). There is no thresholding rule developed yet to determine the optimal threshold for the omission rate, so we suggest this provisional relative scale (is the model potentially useful?):

  • There is high concern; OR > 0.50 in model predictions (n=13; 14%). Most of the species had 3 observations. However, some species with 7-10 observations performed similarly poor. We recommend to let the algorithm run the model (even with 3 observations) and use one of the performance metrics (e.g. OR) to decided model usefulness. We suggest to use alternative ways to inform KBAs initiative, using expert knowledge, current observations and/or critical habitat.

  • There is some concern; OR Between 0.25-0.50 in model predictions (n=7; 7%). This is a group of species with low number of observations (6-13) and intermediate values for OR. We recommend to treat this species also as a ‘concern in model predictions’. We suggest to use alternative ways to inform KBAs initiative, using expert knowledge, current observations and/or critical habitat.

  • Model predictions can be informative; OR <= 0.25 (n=20; 21%): This is a group of species with low to high number of observations (8-47) and high values for OR. We believe that these models can ofe some relevant information to inform KBAs initiative. We expect that running the group of species with computer performance limitations (n=34; 35%) in a HPC will result in an increase of the percentage of informative models to 40-50%. The reason: this group of species present huge number of observations. If true, we could expect at least 500 or so informative HSMs for the entire group of ~1,200 species.



SPECIES REPORTS

List of species triggering A1 (threatened species) and B1 (geographically restricted species) criteria defined in the KBA standard. You can search for species using common or scientific names (in alphabetical order).

Colors refer to what we obtained after implementing HSMs. For instance:

  • Model predictions can be informative.

  • There is some concern in model predictions.

  • High concern in model predictions.

  • Unable to run model predictions: less than 3 observations.

  • Unable to run models (not enough memory): it requires to move HSMs analysis to High Performance Computer (HPC)

  • Unable to run model predictions: observations outside EBARs (may be different coordinates system).

Click on species link (if available) to see results.

Search by: COMMON NAME (Click to expand)

Back to top

Search by: NATIONAL_SCIENTIFIC_NAME (Click to expand)


Back to top

METHODS

In this section we offer a complete description of the methods, tools, data sets and analysis we used and applied in each step of the process.

Figures and maps presented here are only for illustration purposes.

1. HABITAT SUITABILITY MODELS (HSMs)

1.1 Overview (Click to expand)


Introduction

We implemented Habitat Suitability Models (HSMs) to identify high suitable areas within the species geographic range (using EBARs as a proxy, see description below) and evaluate potential changes of these areas due to climate change.

HSMs are ecological tools that relate species observations (presence or presence-absence) and environmental (biotic or abiotic) predictor variables (e.g., climate, land cover, topography) to predict suitable habitat across a landscape or a region (Elith and Leatwick 2009).

We followed, as best as we could, the standards protocols for reporting HSMs proprosed by Zurrell et al 2020.


Model Objective

Two main modelling purposes in this study:

  • Identifying high suitable areas for all species of interest (i.e., objective = ‘Mapping’ sensu Araújo et al 2019); and,

  • Evaluating changes in habitat suitability due to future scenarios of climate change (i.e., objective = ‘Transfering’ sensu Araújo et al 2019).


Data type

Observations

We used presence-only data from NatureServe Canada that includes multiple sources and different types of features.

Data sharing agreements

Due to data sharing agreements we will not display in this tool any features representing occurrences used in this analysis.

Similarly, we also excluded the following data from HSMs modelling:

  • Unobscured versions of iNaturalist obscured records.

  • Canadian Wildlife Service records.

  • Records for British Columbia species subject to persecution and harm.

Our coarse analysis and outputs (grid cell size ~ 1km2) is complying with data providers requirements of data outputs generalization. That is, cell must be generalized to 1000m in any external products generalization.

Data providers

  • NatureServe Canada (NSC)
  • Alberta Fisheries and Wildlife Management Information System (FWMIS)
  • Alberta Conservation Information Management System (ACIMS)
  • Atlantic Canada Conservation Data Centre (ACCDC)
  • British Columbia Conservation Data Centre (BCCDC)
  • Manitoba Conservation Data Centre (MBCDC)
  • Northwest Territories Conservation Data Centre (NWTCDC)
  • Nunavut Conservation Data Centre (NUCDC)
  • Ontario Natural Heritage Information Centre (NHIC)
  • Centre de Données sur le Patrimoine Naturel du Québec (CDPNQ)
  • Saskatchewan Conservation Data Centre (SKCDC)
  • Yukon Conservation Data Center (YTCDC)
  • NatureServe US
  • Ditto
  • Manitoba Museum Collections
  • Bumble Bees of North America Database
  • iNaturalist Restricted - set to “obscured” and/or “private” in the iNaturalist.ca platform by the contributing citizen scientists or for species which the geoprivacy is automatically set to “obscured” (species subject to persecution and harm)
  • Bumble Bee Watch
  • University of Alaska Museum

Species Geographic range maps

We used ecosystem-based automated range (EBAR) maps developed by NatureServe Canada to bound each HSM (see more information here). EBAR maps are comprised of jurisdiction approved Ecoshapes (generally Ecodistricts, Level IV Ecoregions, or similar) that are triggered by species observations. EBAR maps were also subject to a thorough process of review by experts on each of the species of interest.

As August 2021, this is the list and status of EBARs for species triggering A1 and B1 criteria.

  • reviewed by experts (128). We are going to start with this group of species as a proof of concept.

    • Species with EBAR-maps partially reviewed by experts (98).

    • Species with EBAR-maps auto-generated (921).

Predictors

We used raster data sets for environmental predictors available online at similar scales and without any spatial gaps in North America.

  • Habitat heterogeneity metrics: 5
  • Dynamic Habitat Index (DHI) metrics: 3
  • Freshwater distance: 2
  • Bioclim variables: 19


Spatial resolution

HSMs outputs are generated at ~ 1Km2 spatial resolution. Note that all maps were projected to North American Albers Equal Area Conic to preserve the sense of area.


Taxon/species level

Single species models.


Algorithm

We implemented Habitat Suitability Models (HSMs) to predict suitable habitat across the species geographic distribution (Elith and Leatwick 2009). HSMs are ecological tools that relate species observations (presence or presence-absence) and environmental (biotic or abiotic) predictor variables (e.g., climate, land cover, topography). These empirical models rely on correlative relationships between species occurrences and environmental variables (Guisan and Zimmermann 2000).

Regression models (e.g., Generalized linear models-GLMs- and Generalized Additive Models-GAMs-) or ensemble models (e.g., random forest and boosted regressions) have been used for presence-absence datasets; however these ecological surveys are rare. Presence-only dtasets are more frequent in ecology, so alternative methods have been prososed (Wisz et al. 2008). One of this algorithms is the maximum entropy modeling of species geographic distributions algorithm (MaxEnt, see Phillips et al. 2006), with a ‘maxent’ function developed in the ‘dismo’ package in R.

MaxEnt is an algorithm that has been reported to have a high predictive performance compared with other high performing methods (Hernandez et al. 2006). One of the advantages is that Maxent uses presence-only data and is able to cope well with sparse and irregularly sampled data, circumventing the issue of lack of true presence and absence observations in field surveys (Wisz et al. 2008) and (Hernandez et al 2006). Maxent is also “preferable when modelling species that are narrowly distributed and from which not many record locations are available” (Aguirre-Gutierrez et al. 2013) and (Hernandez et al 2006). In addition, the habitat suitability output is a continuous map allowing for the distinction between high and low suitable areas for the species (Phillips et al. 2006).

The principle, from (Phillips et al 2006):

“When approximating an unknown probability distribution, the question arises, what is the best approximation? E.T. Jaynes gave a general answer to this question: the best approach is to ensure that the approximation satisfies any constraints on the unknown distribution that we are aware of, and that subject to those constraints, the distribution should have maximum entropy (Jaynes, 1957). This is known as the maximum-entropy principle.”

For additional information on Maxent see (Elith et al. 2013, Merow et al. 2013, (Phillips et al. 2006) and Philips et al. 2017).

We used the ENMeval v.2.0.0 package (Kass et al 2021 and see ENMeval Vignette here) to implement the Maxent algorithm. ENMEval provides functionality to model tuning, methods for partitioning occurrence data and reports multiple performance metrics (Kass et al 2021).


Extent of the analysis

We used the high resolution map of North America from Global Self-consistent, Hierarchical, High-resolution Geography Database (GSHHG):

  • GSHHS_shp: Shapefiles (polygons) derived from shorelines. These are named GSHHS__L. + f : Full resolution. These contain the maximum resolution of this data and has not been decimated.
    • Level 1: Continental land masses and ocean islands, except Antarctica.

We consider the entire geographic range of all species, including data sets for North America for those species going across the Canadian border (see an example below).

Back to top

1.2 Data (Click to expand)

1.2.1. Occurrences: presence-only data

Access to data sets: NatureServe Canada data, August/2020- October/2021.


a. Type of data

We used data sets in different formats.

  • Points: all points data.

  • Polygons: element of occurrence (EOs) and critical habitat.

    • Element of occurrence (EOs): EOs represent an area of land and/or water in which a species or natural community is, or was, present (see NatureServe for details).

    • Critical habitat: “Critical habitat is defined under section 2 of SARA as:”the habitat that is necessary for the survival or recovery of a listed wildlife species and that is identified as the species’ critical habitat in the recovery strategy or in an action plan for the species“. Section 49(1)(a) of SARA requires that a species’ Recovery Strategy/Action Plan include an identification of the species’ critical habitat to the extent possible, based on the best available information, including information provided by the Committee on the Status of Endangered Wildlife in Canada (COSEWIC)” (see Species at Risk Act, SARA).

  • Lines: transect surveys.


b. Converting polygons and lines to points

HSMs’ algotithms require observations as a ‘points’ features. So, polygons and lines where converted to raster, using a raster template at ~50m spatial resolution. For lines we applied first a buffer of 25m to convert them to polygons. Then, we converted these grid cells to points, using a ‘centroid’ option within each ~ 1Km2 grid cell. This procedure has a partial effect for correcting sampling bias (Fourcade et al 2014).

Finally, we joint all these points sources when available.


c. Spatial filtering

To avoid issues of spatial sampling and reduce spatial autocorrelation in our HSMs estimates we thinned the occurrence data set of each species using a thinning algorithm (implemented in the spThin package by Aiello-Lammens et al 2015), using a distance of 5 Km (we tested various distances, 1, 3, 5). Spatial filtering can also improve model performance (Boria et al 2014).


1.2.2. Predictors

We included four groups of predictors with data available across North America: topographic heterogeneity, dynamic habitat indices, climate and distance to water.

a. Topographic heterogeneity (TH)

We used two indices of topographic heterogeneity developed by Amatulli et al 2018. Access date (September 17, 20) on this website.

Derived topographic continuous variables (see description here):

DATASET AGGREGATION SOURCES RESOLUTION Date downloaded
Vector Ruggedness Measure (VRM) Mean GMTED2010 1km September 17, 2020
Roughness Mean GMTED2010 1km September 17, 2020
Eastness Mean GMTED2010 1km August 2, 2021
Northness Mean GMTED2010 1km August 2, 2021
Slope Mean GMTED2010 1km August 2, 2021

GMTED2010: Global Multi-resolution Terrain Elevation Data 2010 dataset (250m resolution).

  • Vector Ruggedness Measure (VRM): it “quantifies terrain ruggedness by measuring the dispersion of vectors orthogonal to the terrain surface. Slope and aspect are decomposed into 3-dimensional vector components (in the x, y, and z directions) using standard trigonometric operators, and by calculating the resultant vector magnitude within a user-specified moving window size (in this study 3 × 3).” (Amatulli et al 2018).

  • Roughness: “Roughness is expressed as the largest inter-cell difference of a focal cell and its 8 surrounding cells.” (Amatulli et al 2018).

  • Northness: “a northness value close to 1 corresponds to a northern exposition on a vertical slope (i.e., a slope exposed to very low amount of solar radiation), while a value close to -1 corresponds to a very steep southern slope, exposed to a high amount of solar radiation.”


a.1. Why topographic heterogeneity is important?

TH is expected to increase environmental gradients and create multiple niche ecological opportunities. TH can influence the sub-regional variation of abiotic factors (micro/macroclimates, soil composition, hydrological systems) and biotic factors (number of species, population dynamics and species movements, among others, (Amatulli et al 2018)). In addition, TH can provide shelter and refuge for species and/or small populations during periods of adverse environmental conditions and/or drastic climate change. It has been suggested that heterogeneous environments , offering more potential niches, can promote the coexistence of species therefore might have some effect on the number of species we can find in a site (Habitat heterogeneity hypothesis, Pianka 1966, Cramer and Willig 2004, and empirical evidence in Stein et al 2014), among other factors (e.g., Allouche et al 2012).

Studies have also suggested that TH in concert with other environmental variables can affect geographic range size (Janzen 1967 and McCain 2009 ), species turnover (Buckley and Jetz et al 2008 and Fjeldså et al 2012), endemism (Steinbauaer et al 2016) and life-history characteristics (Bears et al 2009).

Slope aspect can significantly influence microclimate and hydrological process that might result in distinct vegetation types, species turnover and species habitats (Yang et al 2020).


b. Dynamic Habitat Index (DHI)

The Dynamic Habitat Index (DHI) is an integrated metric of vegetation production from satellite imagery that measures the fraction of photosynthetically active radiation (or fPAR) intercepted by vegetation (Coops et al 2008). DHI integrates three measures of vegetation productivity a) cummulative annual productivity, b) the minimum level of vegetation cover, and c) the degree of seasonality (Radeloff et al 2019). These indices are strongly correlated at global scale, however, local differences emerge among indices when considering geographic ranges of the species.

Access date: November, 2019 Spatial resolution: 0.0083 decimal degrees (~ 1 Km resolution) Datasets downloaded from: (DHI, from the Silvis Lab).

For each index we stacked all years (2003-2015) and calculated the mean.


b1. Cummulative annual productivity (CAP, or total greenness, band_1)

It has been shown that CAP accuratelly predict species range size (Nilsen et al 2005), species richness (Hobi et al 2017 and, Radeloff et al 2019), beta diversity (Andrew et al 2012), species abundances (Michaud et al 2014) and productivity across various cover types (Coops et al 2008). It has been also used in conservation planning (Powers et al 2012).


Back to top

b2. Minimum Anual Productivity (MAP, band_2)

“The continual provision of food and habitat resources throughout the year is of particular interest to wildlife conservationists, as changes in the amount and quality of available vegetative cover influences behavior of many herbivorous species, and ultimately, the carnivorous species which prey upon them (Schwartz et al. 2006). Locations without significant snow cover in autumn will maintain a cover of green biomass into winter providing accessible food resources and habitat.” (Coops et al 2008).

b3. Variation of Annual productivity (MAP, band_3)

“Sites with low seasonality are indicative of areas that have consistent vegetation production throughout the year such as evergreen forests” (Coops et al 2008).


c. Climate

Nineteen (19) bioclimatic variables from Worldclim (Hijmans et al 2005) were used in this analysis for both ‘Historical’ and ‘Projected’ climate.

  • BIO1: Annual Mean Temperature
  • BIO2: Mean Diurnal Range (Mean of monthly (max temp - min temp))
  • BIO3: Isothermality (BIO2/BIO7) (* 100)
  • BIO4: Temperature Seasonality (standard deviation *100)
  • BIO5: Max Temperature of Warmest Month
  • BIO6: Min Temperature of Coldest Month
  • BIO7: Temperature Annual Range (BIO5-BIO6)
  • BIO8: Mean Temperature of Wettest Quarter
  • BIO9: Mean Temperature of Driest Quarter
  • BIO10: Mean Temperature of Warmest Quarter
  • BIO11: Mean Temperature of Coldest Quarter
  • BIO12: Annual Precipitation
  • BIO13: Precipitation of Wettest Month
  • BIO14: Precipitation of Driest Month
  • BIO15: Precipitation Seasonality (Coefficient of Variation)
  • BIO16: Precipitation of Wettest Quarter
  • BIO17: Precipitation of Driest Quarter
  • BIO18: Precipitation of Warmest Quarter
  • BIO19: Precipitation of Coldest Quarter


c.1. Historical climate

  • Time period: Historical data for 1970-2000
  • Version: WorldClim v 2.1
  • Access date: June, 2020
  • Spatial resolution: 30 seconds (~1 km)


c.2. Projected climate

  • Time period: Future 2050 (average for 2041-2060)
  • Access date: June, 2020
  • Spatial resolution: 30 seconds (~1 km)
  • Datasets can be downloaded from Worldclim here

For each projected bioclimatic variable (n=19), we downloaded data for three Global Circulation Models (GCMs) and for two Representative Concentrations Pathwways (RCPs), as described below. Then, we averaged all GCMs within each RCP for each variable. This average was used to implement HSMs for future climate scenarios. In addition, and as a measure of uncertainty, we calculated the coefficient of variation (cv, a measure of dispersion among all GCMs) for each bioclim variables. We expect low values of ‘cv’ across space for all bioclim variables, meaning that low uncertainty and realibility of the data sets.

Collaborative framework:

We used the Coupled Model Intercomparison Project (CMIP)-Phase 5 which is a collaborative framework aimed to improve the knowledge on climate change (CMIP). We will update this as soon as the new CMIP6 datasets are available online.

Global Circulation Models (GCMs):

GCMs are mathematical models of the general circulation of the atmosphere and allow predictions of future scenarios of cliamte change (GCMs). We screened and selected GCMs based on previous published evaluations (e.g., Sheffield et al 2013) and also on information available for all bioclimatic variables. These are the GCMs selected:

  • Community Climate System Model version 4 (CCSM4)
  • Hadley Centre Global Environment Model version 2-Earth System (HadGEM2-ES)
  • Model for Interdisciplinary Research on Climate version 5 (MIROC-ESM)

Representative Concentrations Pathwways (RCPs):

RCPs are greenhouse concentration (not emissions) trajectories adopted by IPCC, describing different climate futures (as shown in the graph below and see data here).

We used two contrasting RCPs for the analysis:

  • RCP4.5: more or less stable throughout the century (radiative forcing is stabilised at approximately 4.5 W m-2)
  • RCP8.5: extreme scenario (radiative forcing reaches greater than 8.5 W m-2 by 2100 and continues to rise for some amount of time).


c.3. Why climate is important?

Climate has shown to be an important driver of species distributions (e.g., Araujo et al 2005, Davies et al 2011, and Vásquez & Currie 2014), and climate change might threatens conservation areas and biodiversity (Araujo et al 2011 and Newbold 2018).


d. Water We recognized the importance of freshwater bodies for species presence, so we constructed two metrics using lakes from HydroLakes v1.

d1. Distance to water

We calculated the distance to from every single grid cell in the study region to the nearest grid cell representing a lake.

d2. Percentage of grid cells (100Km2) that represent lakes within a 1Km2 grid cell

We calculated the percentage of grid cells (100Km2) that repesent lakes within a 1Km2 grid cell.

Back to top
1.3 Model fitting (Click to expand)

1.3.1. Multicollinearity

We used a correlation coefficient threshold of 0.7 (Dormann et al 2012) to select a set of uncorrelated variables, using removeCollinearity fucntion from the virtualspecies package in R.


1.3.2. Evaluating multiple models

We used the ENMeval package (Muscarrella et al 2014) to evaluate a series of candidate models with various settings (see ENMeval Vignette here), including multiple features classes (FCs; adding more FCs enables more complexity) and minimizing overfitting via regularization multipliers (RMs; penalizing “the inclusion of additional parameters that result in little or no ‘gain’ to the model” Muscarrella et al 2014). The ENMeval package implements the MaxEnt algorithm.


1.3.3. Number of observations and model performance

Model building (e.g., type of features, partition method; see next section) and performance is mediated by the number of species occurrences available.

Although Maxent has been reported to perform well among various algorithms (Hernandez et al. 2006) and is less sensitive to sample size; there is also some concern that this algorithm (and others) does not predict “consistently well with small sample size (n < 30) and this should encourage highly conservative use of predictions based on small sample size and restrict their use to exploratory modelling” (Wisz et al 2008).

Some studies have reported that the minimum number of occurrences required to obtain good model fitting range from 14 for range restricted species to 25 for widespread species (van Proosdij et al 2015). Others, based on empirical studies have recommended 13 as the minimum sample size (Randal, personal communication).

There is also a general rule of thumb that the minimum sample size (number of occurrences from presence-only data sets) should be 10 times larger the number of variable predictors to reduce model overfitting (Harrell et al 1996). However, for species with small number of occurrences, maintaining this ratio (1:10) is quite challenging (e.g., species with 30 occurrences will allow to include only 3 predictors). A promising area of study is the implementation ensemble bivariate small models (Brenier et al 2015), that can circumvent the limitation of small number of occurrences. Now, Maxent standard methods seems to still perform well for this exploratory analysis.

Here, we allowed the algorithm to run the analysis and identified model usefulness based on performance metrics.


1.3.4. Model settings

a. Optimization of model parameters

Because it is computationally prohibitive including all potential feature combinations within a model, we constructed models including as much complexity as we could (i.e., combining FCs and RMs). For instance,in complex models we combined 5 FCs and 3 RMs for a total of 15 models.

d1. Features types

“The idea of Maxent is to estimate a target probability distribution by finding the probability distribution of maximum entropy (i.e., that is most spread out, or closest to uniform), subject to a set of constraints that represent our incomplete information about the target distribution. The information available about the target distribution often presents itself as a set of real-valued variables, called “features”, and the constraints are that the expected value of each feature should match its empirical average (average value for a set of sample points taken from the target distribution). When Maxent is applied to presence-only species distribution modeling, the pixels of the study area make up the space on which the Maxent probability distribution is defined, pixels with known species occurrence records constitute the sample points, and the features are climatic variables, elevation, soil category, vegetation type or other environmental variables, and functions thereof." (Phillips et al 2006).

In summary, “features” constrain the computed probability distribution and represent ecological assumptions (i.e., predictors that cosntrain the geographical distribution of the species (Phillips et al 2006)

The available feature types are and see description in (Li et al 2020):

  • Linear (L): “Linear features constrain the output distribution for each species to have the same expectation of each of the continuous climate variables as the sample locations for that species. A linear feature is simply one of the continuous climate variables”.

  • Quadratic (Q):“Quadratic features (when used together with linear features) constrain the output distribution to have the same expectation and variance of the climate variables as the samples. A quadratic feature is the square of one of the continuous climate variables”.

  • Product (P): “A product feature is the product of two continuous climate variables; when used with linear and quadratic features, product features constrain the output distributions to have the same covariance for each pair of climate variables as the samples”.

  • Threshold (T): "A threshold feature is derived from a continuous climate variable. For a threshold value v, the threshold feature is binary (taking values 0 and 1) and is 1 when the variable has value greater than

  1. The effect of a threshold feature is to make the total probability of grid cells with a value greater than the threshold be equal to the fraction of sample locations with the value above the threshold v".
  • Hinge (H): “A hinge feature is also derived from a continuous environmental variable. It is like a linear feature, but it is a constant below a threshold v”.

d2. Features types and observations

We selected the type of features according to the number of observations (Merow et al 2013), as follow :

Features_type Number of Observations
None 1
Linear(L) >2
Quadratic(Q) >10
Hinge(H)
Threshold(T) >80
Product(P) >80

d3. Feature combinations (FCs)

  • Less than 10 observations (L)
    • 11 - 15 observations (L, Q, LQ)
    • 16 - 80 observations (L, Q, LQ, H, QH)
    • More than 80 observations (Q, H, LQ, LQP, QPT).

Note1: “Linear features are special cases of hinge features, so it is redundant to use L and H features simultaneously” (Phillips and Dudík 2008).

Remember that the number of parameters retained in the final best model is a function of the number features created in model fitting, which in turn depends on the number of observations and predictors. For instance,

  • Linear features: one per predictor

  • Product features: it is calculated as follow: predictors!/pairs!(Observations-2)

  • Feature combinations can lead to high number of features created in model fitting, therefore more parameters in the model. “For example, if there are 100 presence observations the number of piecewise features is given by: 3 (types of piecewise functions) × 99 (pairs of data point) × 19 (predictors) = 5643. This collection of features is explored by MaxEnt, and the most useful features are extracted.” (Merow et al 2013).

d4. Regularization multiplier (RM)

The RM prevents model over-fittig and smooths the distribution, making it more regular.

“A smaller RM value will result in a more localized output distribution that fits the given occurrence records better, but is more prone to overfitting. Overfitted models are poorly transferable to novel environments and, thus, not appropriate to project distribution changes under environmental change. A larger RM value will produce a prediction that can be applied more widely.” (Li et al 2020).

“More complex feature classes will require more regularization to yield accurate predictions.” Phillips and Dudík (2008)

We used regularization values (RMs) proposed by (Phillips and Dudík 2008), based on number of observations. We evaluated these RMs.

  • 0.05
  • 0.50
  • 1.00 (Default in MaxEnt)

d5. Data partitioning method

  • Crossvalidate (More than 25 observations; kfolds=10).
  • K-1 Jackknife (less than 25 observations).

d6. Model complexity

Note that it is computationally prohibitive including all potential feature combinations within ENMeval evalution (i.e., more complexity). So, we tried to include and evaluate as much compleity as we could. As a result, the more complex set of candidates models is 15, combining 5 FCs (Q, H, LQ, LQP, QPT) and 3 RMs (0.05, 0.5, 1).

d7. The effect of number of background points and sampling bias

We selected the best model within each of these six groups of models based on the lowest AICc value (i.e., six best models). Then we selected from this list of six models the ‘best model’ based on the highest AUC value. Further analysis were beased on this ‘best model’.

  • Background points (bg_points).

    • If number grid cells are smaller than 100,000 then we tested (20%, 40%, 60% of the grid cells as a bacground points).
    • If number grid cells are greater than 100,000 then we tested (5%, 10%, 20% of the grid cells as a bacground points).
  • Minimizing sampling bias. Sampling bias is the result of different sampling efforts across environmental space, resulting in models predicting more sampling effort than habitat suitability areas for the species (Moua et al 2020). Phillips et al (2009) proposed that choosing background data with the same bias as presence data can minimize sampling bias. Here we constructed a sampling bias layer using observation points (i.e., two-dimensional kernel density estimate, using the kde2d function from MASS package).

    • No_minimizing_bias: the sampling area for ‘Background points’ included grid cells from the entire geographic range .
    • Minimizing_bias: the sampling area for ‘Background points’ included only grid cells from the the bias layer for the species .

Back to top
1.4 Model Assessment (Click to expand)

a. Approach

We implemented a sequential method to select the best model using two performance metrics. This approach uses cross-validation results by selecting models with the lowest average test omission rate, and to break ties, with the highest average validation AUC (Kass et al 2020 and Radosavljevic & Anderson 2014). Further analysis were based on this ‘best model’.

  • Omission rate (OR) indicates the “fraction of the test localities that fall into pixels not predicted as suitable for the species. A low omission rate is a necessary (but not sufficient) condition for a good model.”(Phillips et al., 2006). There is no thresholding rule developed yet to determine the optimal threshold for the omission rate, so we suggest this provisional relative scale (is the model potentially useful?): (0 - 0.25, the model can be informative; 0.25 - 0.50, there is some concern in model predictions; > 0.50 there is high concern in model predictions). Omission rate values(or.10p.avg).

  • AUC values are “the probability that a random positive instance and a random negative instance are correctly ordered by the classifier.” (Phillips et al., 2006). AUC values above 0.75 are considered potentially useful (Elith. 2002 and Phillips and Dudík 2008). AUC values(auc.val.avg).

The graph below compares the average test omission rate (y-axis) across multiple regularization values (rm, x-axis). Features (groups in colors) with NA values are not shown in the graph, see summarizing tables below for details).

  • Columns represent controlling for sampling bias
    • Panels in first column = using bias layer
    • Panels in second column = No_bias layer
  • Rows represent testing various background points = 20%, 40% and 60% of grid cell from the geographic range


The graph below compares the average validation AUC (y-axis) across multiple regularization values (rm, x-axis). Features (groups in colors) with NA values are not shown in the graph, see summarizing tables below for details).

Note that AUC values above 0.75 (blue dashed line) are considered potentially useful (Elith. 2002 and Phillips and Dudík 2008).


b. Variables importance

  • Permutation importance: “for each environmental variable in turn, the values of that variable on training presence and background data are randomly permuted. The model is reevaluated on the permuted data, and the resulting drop in training AUC is shown in the table, normalized to percentages.” (from ‘dismo’).

An example is presented below for variable importance:


Back to top

c. Best model (parameters)

We used the eval.models function from the ENMeval package to identify the best model.

The table below presents a summary of model parameters for the best model. We used this model for further analyses.

1.5 Model predictions (Click to expand)

We used the eval.predictions function from ENMeval to create the habitat suitability map from the best model(see figure below).

Back to top
1.6 Uncertainty map (Click to expand)

The map shows a measurement of uncertainty based on the coefficient of variation among 10 HSA maps generated in model fitting.

1.7 R packages (Click to expand)

We used R version 3.6.3 [64-bit] for all analyses (R Core Team, 2018) and the following R packages:


Back to top
2. POTENTIAL AREAS IDENTIFICATION (Click to expand)

We used Kernel analysis (three radius) to evaluate whether high suitable areas (HSA = 0.75 to 1.00) were clumped, allowing the identification of discrete potential sites to harbor KBAs.

We used Kernel density estimator by transforming high suitable areas into points and computing an isotropic kernel intensity estimate of the point pattern. We evaluated various radious (3 Km, 4 Km and 5Km) to calculate the density of points within that radious-window.

We tested here the highest percentile of the HSA map (HSA = 0.75 to 1.00).

Preliminary analysis showed that 5Km window better differentiate clusters within the landscape.


R packages

Back to top
4. CLIMATE CHANGE EFFECT

The effects of climate change on species distributions have been broadly documented (e.g., McKenney et al 2011, Pacifici et al 2015) and also the ability of species to track climate change (e.g., Schloss et al 2012). Therefore, high suitable areas identified by HSMs under historical conditions (current climate) can be different in the future.

We compared models using current climate with two projected climate scenarios:

We also included the spatial distribution of the major climate changes within the geographic range of each species.

We calculated the spatial distribution of the major climate changes within the geographic range of each species, using this formula:

HSA change = current HSA - projected HSA.

‘Lossing (-)’ suitable areas are shown in red, while blue areas indicate ‘gaining’ suitable areas in the future. Yellow color (zero values) indicates no change at all in the suitable areas.

Back to top
5. ACKNOWLEDGEMENTS


Juan Zuloaga wants to thanks funding from Liber-Ero – Biodiversity Conservation

We would like also to thanks some researcher that contributed to this research:

Planning:

  • Ciara Raudsepp_Hearne (KBA Canada coordinator)

Datasets:

  • Chloe Debyser (KBA Canada Technical Coordinator)
  • Randal Green (NatureServe Canada, Technical Coordinator, EBAR Project)
  • Christine Terwissen (NatureServe Canada, National Coordinator, EBAR Project)

HSMs reviewrs:

  • Maria Leung (Yukon species expert)
  • Richard Schuster (Carleton University)
  • Jacqueline Clare (BC Conservation Data Centre)
  • Varina Crisfield (Sherbrook University)
  • Guillaume Blanchet (Sherbrook University)
  • Dave Fraser

Software troubleshooting:

Guillaume Larocque (McGill University) Val Lucet (McGill University) Gonzalo Pinilla (CUNY)

Back to top